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Q ! Abstract 

, We consider the problem of classification using similarity/distance functions over 

data. Specifically, we propose a framework for defining the goodness of a 
(dis)similarity function with respect to a given learning task and propose algo- 
rithms that have guaranteed generalization properties when working with such 
' good functions. Our framework unifies and generalizes the frameworks proposed 

c/3 | by Q and (2). An attractive feature of our framework is its adaptability to data 

O . - we do not promote a fixed notion of goodness but rather let data dictate it. We 

show, by giving theoretical guarantees that the goodness criterion best suited to a 
problem can itself be learned which makes our approach applicable to a variety of 
domains and problems. We propose a landmarking-based approach to obtaining a 
■^j- ' classifier from such learned goodness criteria. We then provide a novel diversity 

CD ■ based heuristic to perform task-driven selection of landmark points instead of ran- 

dom selection. We demonstrate the effectiveness of our goodness criteria learning 
l/"*) | method as well as the landmark selection heuristic on a variety of similarity-based 

^vq . learning datasets and benchmark UCI datasets on which our method consistently 

outperforms existing approaches by a significant margin. 
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1 Introduction 



Machine learning algorithms have found applications in diverse domains such as computer vision, 
bio-informatics and speech recognition. Working in such heterogeneous domains often involves 
handling data that is not presented as explicit features embedded into vector spaces. However in 
many domains, for example co-authorship graphs, it is natural to devise similarity /distance functions 
over pairs of points. While classical techniques like decision tree and linear perceptron cannot handle 
such data, several modern machine learning algorithms such as support vector machine (SVM) can 
be kernelized and are thereby capable of using kernels or similarity functions. 

However, most of these algorithms require the similarity functions to be positive semi-definite 
(PSD), which essentially implies that the similarity stems from an (implicit) embedding of the data 
into a Hilbert space. Unfortunately in many domains, the most natural notion of similarity does not 
satisfy this condition - moreover, verifying this condition is usually a non-trivial exercise. Take for 
example the case of images on which the most natural notions of distance (Euclidean, Earth-mover) 
1 3 1 do not form PSD kernels. Co-authorship graphs give another such example. 

Consequently, there have been efforts to develop algorithms that do not make assumptions about 
the PSD-ness of the similarity functions used. One can discern three main approaches in this area. 
The first approach tries to coerce a given similarity measure into a PSD one by either clipping or 
shifting the spectrum of the kernel matrix HE)- However, these approaches are mostly restricted to 
transductive settings and are not applicable to large scale problems due to eigenvector computation 
requirements. The second approach consists of algorithms that either adapt classical methods like 
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A;-NN to handle non-PSD similarity /distance functions and consequently offer slow test times 0, 
or are forced to solve non-convex formulations (6] |7) . 

The third approach, which has been investigated recently in a series of papers H]|2][8][9], uses the 
similarity function to embed the domain into a low dimensional Euclidean space. More specifically, 
these algorithms choose landmark points in the domain which then give the embedding. Assuming 
a certain "goodness" property (that is formally defined) for the similarity function, these models 
offer both generalization guarantees in terms of how well-suited the similarity function is to the 
classification task as well as the ability to use fast algorithmic techniques such as linear SVM ifTUl 
on the landmarked space. The model proposed by Balcan-Blum in [ 1 1 gives sufficient conditions for 
a similarity function to be well suited to such a landmarking approach. Wang et al. in [2 1 on the other 
hand provide goodness conditions for dissimilarity functions that enable landmarking algorithms. 

Informally, a similarity (or distance) function can be said to be good if points of similar labels 
are closer to each other than points of different labels in some sense. Both the models described 
above restrict themselves to a fixed goodness criterion, which need not hold for the underlying data. 
We observe that this might be too restrictive in many situations and present a framework that al- 
lows us to tune the goodness criterion itself to the classification problem at hand. Our framework 
consequently unifies and generalizes those presented in [1| and [2|. We first prove generalization 
bounds corresponding to landmarked embeddings under a fixed goodness criterion. We then pro- 
vide a uniform-convergence bound that enables us to learn the best goodness criterion for a given 
problem. We further generalize our framework by giving the ability to incorporate any Lipschitz 
loss function into our goodness criterion which allows us to give guarantees for the use of various 
algorithms such as C-SVM and logistic regression on the landmarked space. 

Now similar to fl] [2], our framework requires random sampling of training points to create the 
embedding spac^j. However in practice, random sampling is inefficient and requires sampling of a 
large number of points to form a useful embedding, thereby increasing training and test time. To 
address this issue, 1121 proposes a heuristic to select the points that are to be used as landmarks. 
However their scheme is tied to their optimization algorithm and is computationally inefficient for 
large scale data. In contrast, we propose a general heuristic for selecting informative landmarks 
based on a novel notion of diversity which can then be applied to any instantiation of our model. 

Finally, we apply our methods to a variety of benchmark datasets for similarity learning as well as 
ones from the UCI repository. We empirically demonstrate that our learning model and landmark 
selection heuristic consistently offers significant improvements over the existing approaches. In 
particular, for small number of landmark points, which is a practically important scenario as it is 
expensive to compute similarity function values at test time, our method provides, on an average, 
accuracy boosts of upto 5% over existing methods. We also note that our methods can be applied on 
top of any strategy used to learn the similarity measure (eg. MKL techniques ffTTI ) or the distance 
measure (eg. [ 12 1) itself. Akin to (TJ, our techniques can also be extended to learn a combination of 
(dis)similarity functions but we do not explore these extensions in this paper. 

2 Methodology 

Let T> be a fixed but unknown distribution over the labeled input domain X and let i : X — > 
{ — 1, +1} be a labeling over the domain. Given a (potentially non-PSD) similarity function^ K : 
X x X — > M, the goal is to learn a classifier I : X — > { — 1, +1} from a finite number of i.i.d. 
samples from T> that has bounded generalization error over T>. 

Now, learning a reasonable classifier seems unlikely if the given similarity function does not have 
any inherent "goodness" property. Intuitively, the goodness of a similarity function should be its 
suitability to the classification task at hand. For PSD kernels, the notion of goodness is defined 
in terms of the margin offered in the RKHS 1131 . However, a more basic requirement is that the 
similarity function should preserve affinities among similarly labeled points - that is to say, a good 
similarity function should not, on an average, assign higher similarity values to dissimilarly labeled 
points than to similarly labeled points. This intuitive notion of goodness turns out to be rather robust 

1 Throughout the paper, we use the terms embedding space and landmarked space interchangeably. 
2 Results described in this section hold for distance functions as well; we present results with respect to 
similarity functions for sake of simplicity. 
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in the sense that all PSD kernels that offer a good margin in their respective RKHSs satisfy some 
form of this goodness criterion as well lfl4l . 

Recently there has been some interest in studying different realizations of this general notion of 
goodness and developing corresponding algorithms that allow for efficient learning with similar- 
ity/distance functions. Balcan-Blum in [1| present a goodness criteria in which a good similarity 
function is considered to be one that, for most points, assigns a greater average similarity to sim- 
ilarly labeled points than to dissimilarly labeled points. More specifically, a similarity function is 
(e, 7)-good if there exists a weighing function w : X — > M such that, at least a (1 — e) probability 
mass of examples x ~ T> satisfies: 

E [w{x')K{x,x')\£(x')=l{x)]> E [w(x')K(x,x')\£(x') ^ t(x)] + 7. (1) 

x j ' r^T> x'^T> 

where instead of average similarity, one considers an average weighted similarity to allow the defi- 
nition to be more general. 

Wang et al in (2l define a distance function d to be good if a large fraction of the domain is, on 
an average, closer to similarly labeled points than to dissimilarly labeled points. They allow these 
averages to be calculated based on some distribution distinct from V, one that may be more suited 
to the learning problem. However it turns out that their definition is equivalent to one in which one 
again assigns weights to domain elements, as done by [ 1|, and the following holds 

E [w(x')w(x")sga(d(x,x") - d(x,x'))\£(x') = £(x),£(x") ^ £(x)} > 7 (2) 

x' ,x" r~T> y.T> 

Assuming their respective goodness criteria, [T) and 1121 provide efficient algorithms to learn clas- 
sifiers with bounded generalization error. However these notions of goodness with a single fixed 
criterion may be too restrictive in the sense that the data and the (dis)similarity function may not sat- 
isfy the underlying criterion. This is, for example, likely in situations with high intra-class variance. 
Thus there is need to make the goodness criterion more flexible and data-dependent. 

To this end, we unify and generalize both the above criteria to give a notion of goodness that is more 
data dependent. Although the above goodness criteria (fl} and (fJJ seem disparate at first, they can 
be shown to be special cases of a generalized framework where an antisymmetric function is used 
to compare intra and inter-class affinities. We use this observation to define our novel goodness 
criterion using arbitrary bounded antisymmetric functions which we refer to as transfer functions . 
This allows us to define a family of goodness criteria of which (fTJ and (O form special cases ((Q]l 
uses the identity function and (f2]i uses the sign function as transfer function). Moreover, the resulting 
definition of a good similarity function is more flexible and data dependent. In the rest of the paper 
we shall always assume that our similarity functions are normalized i.e. for the domain of interest 
X, sup K(x, y) < 1. 

Definition 1 (Good Similarity Function). A similarity function K : X x X — > K is said to be 
an (e, 7, B)-good similarity for a learning problem where e, 7, B > if for some antisymmetric 
transfer function f : M — > R and some weighing function w : X x X — > [-B, B], at least a (1 — e) 
probability mass of examples x ~ T> satisfies 

E [to (x\ x") f (K(x, x') - K(x, x")) \£{x') = £[x),£{x") ^ £(x)} > C n (3) 

x' ,x" ^T>xV 

where Cf— sup f(K(x,x'))— inf f(K(x,x')) 

x,x'£X x,x'GX 

As mentioned before, the above goodness criterion generalizes the previous notions of goodness 
and is adaptive to changes in data as it allows us, as shall be shown later, to learn the best possible 
criterion for a given classification task by choosing the most appropriate transfer function from a 
parameterized family of functions. We stress that the property of antisymmetry for the transfer 
function is crucial to the definition in order to provide a uniform treatment to points of all classes as 
will be evident in the proof of Theorem[2] 

As in [HEl, our goodness criterion lends itself to a simple learning algorithm which consists of 
choosing a set of d random pairs of points from the domain V = { (xf , x~[) } ._j (which we refer to 

3 We refer the reader to the appendix for a discussion. 
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as landmark pairs) and defining an embedding of the domain into a landmarked space using these 

landmarks: <S> L : X -> M rf ,$ L (a;) = (f(K(x, xf) - K{x,xr))) d =l e R d . The advantage of 
performing this embedding is the guaranteed existence of a large margin classifier in the landmarked 
space as shown below. 

Theorem 2. If K is an (e, 7, B)-good similarity with respect to transfer function f and weight func- 
tion w then for any t\ > 0, with probability at least 1 — 5 over the choice of d — (8/j 2 ) ln(2/<5ei) 

positive and negative samples, C T> + and {2^}^— 1 ^ respectively, the classifier 

h(x) — sgn[g(x)] where g(x) = k J2i=i w ( x t ' x 7)f (K( x > x t ) — K( x > x 7)) has error no more 
than e + £\ at margin 2 

Proof. We shall prove that with probability at least 1 — 6, at least a 1 — t\ fraction of points x that 
satisfy Equation|3]are classified correctly by the classifier h(x). Overestimating the error by treating 
the points that do not satisfy Equation[3]as always being misclassified will give us the desired result. 

For any fixed x <E X + that satisfies Equation[3] we have 

E [w(x',x")f (K(x, x 1 ) - K{x, x")) \£(x') = 1, £(x") = -1] > C n 

x' ,x" r^T>xT> 



hence the Hoeffding Bounds give us 



Pr 



g( x ) < 



7 



Pr 



J2 w i x t> x i )f( K ( x , x i)- K ( x > x i )) < 



< 2exp 



7 2 d 



Similarly, for any fixed x € X that satisfies Equation[3] we have 

E [w(x', x")f (K(x, x 1 ) - K{x, x")) \£(x') = -1, t{x") = 1] > C /7 

hence the Hoeffding Bounds give us 



Pr 



g( x ) > \ 



-j5>(x+ a,")/ (K(x,xf)-K(x,x7)) > 1 



= Pr 



i=l 
</ 



; 2exp(-i^- 



where in the second step we have used antisymmetry of /. 

Since we have shown that this result holds true individually for any point x that satisfies Equation|3] 
the expected error (where the expectation is both over the choice of domain points as well as choice 

of the landmark points) itself turns out to be less than 2exp ^— < eiS. Applying Markov's 

inequality gives us that the probability of obtaining a set of landmarks such that the error on points 
satisfying Equation|3]is greater than t\ is at most <5. 

Assuming, as mentioned earlier, that the points not satisfying Equation[3]can always be misclassified 
proves our desired result. □ 



However, there are two hurdles to obtaining this large margin classifier. Firstly, the existence of this 
classifier itself is predicated on the use of the correct transfer function, something which is unknown. 
Secondly, even if an optimal transfer function is known, the above formulation cannot be converted 
into an efficient learning algorithm for discovering the (unknown) weights since the formulation 
seeks to minimize the number of misclassifications which is an intractable problem in general. 

We overcome these two hurdles by proposing a nested learning problem. First of all we assume 
that for some fixed loss function L, given any transfer function and any set of landmark pairs, it is 
possible to obtain a large margin classifier in the corresponding landmarked space that minimizes L. 
Having made this assumption, we address below the issue of learning the optimal transfer function 
for a given learning task. However as we have noted before, this assumption is not valid for arbitrary 
loss functions. This is why, subsequently in Section l2~l2l we shall show it to be valid for a large class 
of loss functions by incorporating surrogate loss functions into our goodness criterion. 
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2.1 Learning the transfer function 



In this section we present results that allow us to learn a near optimal transfer function from a family 
of transfer functions. We shall assume, for some fixed loss function L, the existence of an efficient 
routine which we refer to as TRAIN that shall return, for any landmarked space indexed by a set of 
landmark pairs V, a large margin classifier minimizing L. The routine TRAIN is allowed to make 
use of additional training data to come up with this classifier. 

An immediate algorithm for choosing the best transfer function is to simply search the set of pos- 
sible transfer functions (in an algorithmically efficient manner) and choose the one offering lowest 
training error. We show here that given enough landmark pairs, this simple technique, which we 
refer to as FTUNE (see Algorithmic is guaranteed to return a near-best transfer function. For this 
we prove a uniform convergence type guarantee on the space of transfer functions. 

Let T C [—1, 1] R be a class of antisymmetric functions and W = [— B, B] XxX be a class of weight 
functions. For two real valued functions / and g defined on X, let ||y — S'lloo := SU P \f( x ) ~ 9( x )\- 

xt£X 

Let Boo(f,r) := { /' G T | ||/ - f'\\oo < r}. Let L be a Ci-Lipschitz loss function. Let V = 
{ [xf , x~ ) } . =1 be a set of (random) landmark pairs. For any / 6 T , w € W, define 

G {fw) {x) = E [w{x',x")f(K{x,x')-K(x,x"))\l{x')=£(x),l{x")^l(x)] 

' x' ,x" ^T)xT> 

i d 

9(f,w)(x) = -^2w(x+,x t ) f (K(x,x+) - K(x, Xl )) 

i=l 

Theorem[7](see Section |2~2l > guarantees us that for any fixed / and any ei > 0, if d is large enough 
then E [L(g^f w ^(x))\ < E [L(G(f <w -)(x))j + e\. We now show that a similar result holds even if 

one is allowed to vary /. Before stating the result, we develop some notation. 

For any transfer function / and arbitrary choice of landmark pairs V, let Wr g j\ be the best 
weighing function for this choice of transfer function and landmark pairs i.e. let Wr g j-s = 
arg min E \L (git w \ (x))l Q. Similarly, let wiq f) be the best weighing function corresponding 

we[-B : B] d X ~ V 

to G i.e. wiq r\ = arg min E [L w ) {x))] ■ Then we can ensure the following : 

Theorem 3. Let T be a compact class of transfer functions with respect to the infinity norm and 
ei, 5 > 0. Let Af (J 7 , r) be the size of the smallest e-net over J- with respect to the infinity norm at 
scale r = 4( 4* g . Then if one chooses d = ® iB ^ L l n ( iggg^rl ) random landmark pairs then 
we have the following with probability greater than (1 — 8) 



sup 



E 

x^>T> 



(s))] - x E ? [L (G iLw(G f)) (x)) 



< ei 



We shall prove the theorem in two parts. As we shall see, one of the parts is fairly simple to prove. 
To prove the other part, we shall exploit the Lipschitz properties of the loss function as well as 
the fact that the class of transfer functions chosen form a compact set. Let us call a given set of 
landmark pairs to be good with respect to a fixed transfer function / e T if for the corresponding 
g, E [L{g(x))] < E [L(G(x))] + e\ for some small fixed ei > 0. 

X X 

We will first prove, using Lipschitz properties of the loss function that if a given set of landmarks is 
good with respect to a given transfer function, then it is also good with respect to all transfer func- 
tions in its neighborhood. Having proved this, we will apply a standard covering number argument 
in which we will ensure that a large enough set of landmarks is good with respect to a set of transfer 
functions that form an e-net over T and use the previous result to complete the proof. 

We first prove a series of simple results which will be used in the first part of the proof. In the 
following / and /' are two transfer functions such that /' € B^ (/, r) n T . 

Lemma 4. The following results are true 

4 Note that the function gtf, w ) (x) is dictated by the choice of the set of landmark pairs V 
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1. For any fixed f £ F, E 

x^T> 



L (<%» (G ,/) )(*))] < JE, [ L (G (f , w) (x))]forallw£ W. 



2. For any fixed f £ F, any fixed g obtained by an arbitrary choice of landmark pairs, 



L (#(;>'(.,,,) )0)) < J p [ L {g(f,w){x))] for all w £ W. 



3. For any f £ 6 00 (/,r)n^, jjE^ [l (G (f . W(GJ)) (x) 
Ctrl''. 



■ E 

x~T> 



L 



G (/',-(G,/'))( X ) 



< 



4. For any fixed g obtained by an arbitrary choice of landmark pairs, /' £ Boo(f, r) D T, 



E 



< C L rB. 



Proof. We prove the results in order, 



1 . Immediate from the definition of w, 



(£»,/)• 



2. Immediate from the definition of w. 



(a,/)- 



3. We have E 



(G if ,, W{Gn) (x))] < E v [L(G (f ,, W(GJ)) ( X )) 



by an application of 

Lemma l4lT1 proven above. For sake of simplicity let us denote — w for the next set 

of calculations. Now we have 



i',i"~DxC 



[w [x', x") f (K{x, x') - K(x, x")) \£(x') = £(x), i(x") ^ £(x)] 



< E [w(x'x")(f(K(x,x')-K(x > x"))+r)\£(x') =l(x),£(x") ^ £{x)] 

x f ,x"~T>x'V 

E [w(x'x")f(K(x,x')-K(x,x"))\£(x') =£{x),l{x") ^ £{x)] 

x' ,x" ~T>xT> 

+ r- E [w{x'x")\£(x')=£(x),£(x")^£(x)] 

x' ,x" ^jT>XT> 

< G (Lw) {x)+rB 

where in the second inequality we have used the fact that ||/ — /'||„ < r and in the fourth 
inequality we have used the fact that w £ W. Thus we have G^ji , w )(x) < G^f w )(x) + 
rB. Using the Lipschitz properties of L we can now get E [L (G(// >w } {x))] < 



E v [L (G (ftW) (x))] + C L rB. Thus we have JE^ [l (g (/ /,», (g ,,,,)(«)) 



< 



L (G{f\ W{GJ) )(x))] < J< v [L (G {Lw(GJ}) (x)^ 



C L rB. 



Similarly we can also prove 



L(G u , W{GJ)) {x))] <E v [L(G u , W{Gn) (x)) 



CltB. This gives us the desired result. 
4. The proof follows in a manner similar to the one for Lemma |4l3l proven above. 



□ 



Using the above results we get a preliminary form of the first part of our proof as follows : 
Lemma 5. Suppose a set of landmarks is (e±/2)-good for a particular landmark f £ T 



(i.e. E 



L 



E 

Xr^T> 



L 



(ffC/.wco./))^)) 
also t\-g 



L 



< E 

landmarks are also t\-good for any f £ Boc(f,r) f] J- (i.e. for all f £ Boo (/, fl T, 



< E 

x~T> 



(G(j,w {G /))( x )J + e i/2j, then the same set of 
Boo(f,r) n T (i.e. for all f £ B c 
L ( G (/'>«<cg,/'))( x )) + ti) for some r = r (ex). 



Proof. Theorem |7] proven below guarantees that for any fixed / £ T, with probability 1 — 6 that 



E 



L (^g(f,w {GJ) )(x)j < E L ^G(/ iUI(G f) )(x)) + ei/2. This can be achieved with d = 



6 



(64_B 2 C£/e 2 ) ln(8-B/<5ei). Now assuming that the above holds, using the above results we can get 
the following for any /' e B QO (f, r) n T . 



E 



L 



< E 

x~T> 



C l tB 



< E 



< E 

x~V 



C L rB 



< E 

x~V 



(using Lemma l4l4l 

L (s(f,w IG . f) )(x)) 
(using Lemma I4l2l i 

L ( G (f,no, f) )( x ))] +ei/2 + C L rB 
(using Theorem|7]i 

(G (r , WxGtn) (x))] + e 1 /2 + 2C L rB 



L 



Setting r 



iC L l 



(using Lemma l4l3l l 
gives us the desired result. 



□ 



Proof, (of Theorem|3]l As mentioned earlier we shall prove the theorem in two parts as follows : 
1 . (Part I) In this part we shall prove the following : 



sup 



E 



L (s(/,«w))(aO 



E 



L ( G (/,"'(G,/)) 



(•r) 



< £i 



We first set up an e-net over JF at scale r — 4( 4^ B . Let there be J\f (J 7 , r) elements in this 

net. Taking d = (64B 2 Cf,/ 'ef ) ln(8S • AT (T, r) /Sei) landmarks should ensure that the 
landmarks, with very high probability, are good for all functions in the net by an application 
of union bound. Since every function in T is at least r-close to some function in the net, 
Lemma [5] tells us that the same set of landmarks are, with very high probability, good for 
all the functions in T . This proves the first part of our result. 



2. (Part II) In this part we shall prove the following : 



sup 



E 









r (f< w (G,. 



>(*))] ~3d[ l (^"wK) 



< ei 



This part is actually fairly simple to prove. Intuitively, since one can imagine G as being 
the output of an algorithm that is allowed to take the entire domain as its landmark set, 



we should expect E 



L G 



r (i>(G,/)) 



(x) 



< 



to hold uncon- 



ditionally for every /. For a formal argument, let us build up some more notation. As we 
have said before, for any transfer function / and arbitrary choice of d landmark pairs V, we 
let W( g j) E [—B,B] be the best weighing function for this choice of transfer function and 
landmark pairs. Now let W( g j) be the best possible extension of W( g t\ to the entire do- 



main. More formally, for any w* € [— B, B] let w* — arg min 



w^W.wi-p— w 



-x~V 



L G 



(*))]• 



Now Lemma 1411 1 tells us that for any / e T and any choice of landmark pairs V ', 



E 



L G 



T (f, w (G 



,/))(*) 



< E 



L G 



)(*) 



Furthermore, since w 



sen to be the most beneficial extension of UV 9 n, we also have 



L ( G (/.%7)) 



(gJ) 
(x) 



is cho- 

< 



proof. 



L 



9(U 



(x)J . Together, these two inequalities give us the second part of the 

□ 



This result tells us that in a large enough landmarked space, we shall, for each function / 6 T, 
recover close to the best classifier possible for that transfer function. Thus, if we iterate over the 
set of transfer functions (or use some gradient-descent based optimization routine), we are bound to 
select a transfer function that is capable of giving a classifier that is close to the best. 
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2.2 Working with surrogate loss functions 



The formulation of a good similarity function suggests a simple learning algorithm that involves 
the construction of an embedding of the domain into a landmarked space on which the existence 
of a large margin classifier having low misclassification rate is guaranteed. However, in order to 
exploit this guarantee we would have to learn the weights w (xf , x~[ ) associated with this classifier 
by minimizing the empirical misclassification rate on some training set. 

Unfortunately, not only is this problem intractable but also hard to solve approximately lfl5l [161 . 
Thus what we require is for the landmarked space to admit a classifier that has low error with 
respect to a loss function that can also be efficiently minimized on any training set. In such a 
situation, minimizing the loss on a random training set would, with very high probability, give us 
weights that give similar performance guarantees as the ones used in the goodness criterion. 

With a similar objective in mind, [ 1 1 offers variants of its goodness criterion tailored to the hinge loss 
function which can be efficiently optimized on large training sets (for example LIBSVM [ 17 1). Here 
we give a general notion of goodness that can be tailored to any arbitrary Lipschitz loss function. 

Definition 6. A similarity function K : X x X — > R is said to be an (e, B)-good similarity for 
a learning problem with respect to a loss function L : K — > R + where e > if for some transfer 
function f : M — > R and some weighing function w : X x X — > [-B, B], E [L(G(x))] < e where 

G{x)= E [w(x' ) x")f(K(x,x')-K(x,x"))\i(x')=£{x),t(x")^£(x)} 

x' ,x" ~T>xT> 

One can see that taking the loss functions as L(x) — l x <c f -y gives us Equation [3] which defines a 
good similarity under the 0—1 loss function. It turns out that we can, for any Lipschitz loss function, 
give similar guarantees on the performance of the classifier in the landmarked space. 

Theorem 7. If K is an (e, B)-good similarity function with respect to a C L-Lipschitz loss 
function L then for any e\ > 0, with probability at least 1 — 8 over the choice of d = 
(16B 2 Cj j /e 2 ) \n(4B/Sei) positive and negative samples from T> + and T>~ respectively, the ex- 
pected loss of the classifier g(x) with respect to L satisfies E [L(g(x))\ < e + e\ where g{x) = 

X 

1 Eti *» (4 ,Xi)f {K(x, x+) - K(x, xr)). 



Proof. For any x G X, we have, by an application of Hoeffding bounds Pr [|G(x) — g(x) I > ei] < 

g 

2exp ^— tqt) since \g(x)\ < B. Here the notation Pr signifies that the probability is over the 



choice of the landmark points. Thus for d > In (|), we have Pr [|G(x) — g(x)\ > ex] < 8 2 . 

For sake of simplicity let us denote by BAD (a;) the event \G(x) — g(x)\ > ei. Thus we have, for 
every x <E X, E [Ieadm! < <5 2 - Since this is true for every x € X , this also holds in expectation i.e. 

9 

EE [IbadiVh < S 2 . The expectation over x is with respect to the problem distribution T>. Applying 

x g 

Fubini's Theorem gives us EE [1bad(x)1 < ^ 2 which upon application of Markov's inequality gives 

g x 

us Pr E [1bad(x)1 > 5 < 8. Thus, with very high probability we would always choose landmarks 

9 Li J 

such that Pr [BAD(x)\ < 8. Thus we have, in such a situation, E [\G(x) - g(x)\] < (l-8)e 1 +8-2B 

X X 

since sup \G(x) — g(x)\ < 2B. For small enough 8 we have E [|G(ir) — < 2e±. 

xex ' x 

Thus we have E [L{g{x))\ - E [L{G{x))\ = E [L(g(x)) - L(G{x))] <E[C L - \g{x) - G{x)\] = 
Cl ■ E [\g(x) — G(a;)|] < 2Cl^i where we used the Lipschitz properties of the loss function L to 

X 

arrive at the second inequality. Putting e\ = -JjA- we have E [L(q(x))} < E [L(G(x))] +e[ < e + e\ 

0i X X 

which gives us our desired result. 



8 



Algorithm 1 DSELECT 



Require: A training set T, landmarking size d. 
Ensure: A set of d landmark pairs/singletons. 



L -s— get-random-element(T), Pftune <— 
for j — 2 to d do 

z *r- argmin ^ K(x,x'), 

L^LU{z}, T^T\{z} 
end for 

for j = 1 to d do 

Sample .21,2:2, s.t., £(2:1) = 1, £(22) 
randomly from L with replacement 
Pftune <— Pftune U {(zi, 22)} 
end for 

return L (for BBS), Pftune (for FTUNE) 



Algorithm 2 FTUNE 



Require: A family of transfer functions T, a similar- 
ity function K and a loss function L 
Ensure: An optimal transfer function /* 6 T. 



= -1 



Select d landmark pairs V . 
for all / £ T do 

w f <- train(P, L), L f 
end for 

/* <s— arg mini / 
return (/*,«;/»). 



L{w f ) 



Actually we can prove something stronger since 



E[L(g(x))]-E[L(G(x))} 



E[L(g(x)) - L(G(x))} < E [\L(g(x)) - L(G(x) 

X X 

we have e - e' x < E [L(g(x))\ < e + e[. 



< E[C L -\g{x)-G{x) 



< e 



Thus 
□ 



If the loss function is hinge loss at margin 7 then Cl — -■ The — 1 loss function and the loss 
function L(x) — l x <-y (implicitly used in Definition Q] and Theorem[2]) are not Lipschitz and hence 
this proof technique does not apply to them. 

2.3 Selecting informative landmarks 

Recall that the generalization guarantees we described in the previous section rely on random se- 
lection of landmark pairs from a fixed distribution over the domain. However, in practice, a totally 
random selection might require one to select a large number of landmarks, thereby leading to an 
inefficient classifier in terms of training as well as test times. For typical domains such as computer 
vision, similarity function computation is an expensive task and hence selection of a small number 
of landmarks should lead to a significant improvement in the test times. For this reason, we pro- 
pose a landmark pair selection heuristic which we call DSELECT (see Algorithm Q]). The heuristic 
generalizes naturally to multi-class problems and can also be applied to the classification model of 
Balcan-Blum that uses landmark singletons instead of pairs. 

At the core of our heuristic is a novel notion of diversity among landmarks. Assuming if is a nor- 
malized similarity kernel, we call a set of points S C X diverse if the average inter-point similarity 
is small i.e rgTTrgrrr^ ^2 X y( =s x^y K( x > V) ^ 1 ( m case we are working with a distance kernel we 
would require large inter-point distances). The key observation behind DSELECT is that a non- 
diverse set of landmarks would cause all data points to receive identical embeddings and linear 
separation would be impossible. Small inter-landmark similarity, on the other hand would imply 
that the landmarks are well-spread in the domain and can capture novel patterns in the data. 

Similar notions of diversity have been used in the past for ensemble classifiers [ 18 ] and fc-NN clas- 
sifiers [5]. Here we use this notion to achieve a better embedding into the landmarked space. Ex- 
perimental results demonstrate that the heuristic offers significant performance improvements over 
random landmark selection (see Figure[TJi. One can easily extend Although Algorithm [TJ to multi- 
class problems by selecting a fixed number of landmarks from each class. 



3 Empirical results 

In this section, we empirically study the performance of our proposed methods on a variety of bench- 
mark datasets. We refer to the algorithmic formulation presented in JTJ as BBS and its augmentation 
using DSELECT as BBS+D. We refer to the formulation presented in El as DBOOST. We refer to 
our transfer function learning based formulation as FTUNE and its augmentation using DSELECT 
as FTUNE+D. In multi-class classification scenarios we will use a one-vs-all formulation which 
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Dataset/Method 


BBS 


DBOOST 


FTUNE+D-S 


AmazonBinary 


0.73(0.13) 


0.77(0.10) 


0.84(0.12) 


AuralSonar 


0.82(0.08) 


0.81(0.08) 


0.80(0.08) 


Patrol 


0.51(0.06) 


0.34(0.11) 


0.58(0.06) 


Voting 


0.95(0.03) 


0.94(0.03) 


0.94(0.04) 


Protein 


0.98(0.02) 


1.00(0.01) 


0.98(0.02) 


MirexO? 


0.12(0.01) 


0.21(0.03) 


0.28(0.03) 


Amazon47 


0.39(0.06) 


0.07(0.04) 


0.61(0.08) 


FaceRec 


0.20(0.04) 


0.12(0.03) 


0.63(0.04) 



(a) 30 Landmarks 



Dataset/Method 


BBS 


DBOOST 


FTUNE+D-S 


AmazonBinary 


0.78(0.11) 


0.82(0.10) 


0.88(0.07) 


AuralSonar 


0.88(0.06) 


0.85(0.07) 


0.85(0.07) 


Patrol 


0.79(0.05) 


0.55(0.12) 


0.79(0.07) 


Voting 


0.97(0.02) 


0.97(0.01) 


0.97(0.02) 


Protein 


0.98(0.02) 


0.99(0.02) 


0.98(0.02) 


Mirex07 


0.17(0.02) 


0.31(0.04) 


0.35(0.02) 


Amazon47 


0.40(0.13) 


0.07(0.05) 


0.66(0.07) 


FaceRec 


0.27(0.05) 


0.19(0.03) 


0.64(0.04) 



(b) 300 Landmarks 



Table 1: Accuracies for Benchmark Similarity Learning Datasets for Embedding Dimensional- 
ity=30, 300. Bold numbers indicate the best performance with 95% confidence level. 



presents us with an opportunity to further exploit the transfer function by learning separate transfer 
function per class (i.e. per one-vs-all problem). We shall refer to our formulation using a single 
(resp. multiple) transfer function as FTUNE+D-S (resp. FTUNE+D-M). We take the class of ramp 
functions indexed by a slope parameter as our set of transfer functions. We use 6 different values 
of the slope parameter {1, 5, 10, 50, 100, 1000}. Note that these functions (approximately) include 
both the identity function (used by [ 1 1) and the sign function (used by |2])- 

Our goal in this section is two-fold: 1) to show that our FTUNE method is able to learn a more 
suitable transfer function for the underlying data than the existing methods BBS and DBOOST and 
2) to show that our diversity based heuristic for landmark selection performs better than random 
selection. To this end, we perform experiments on a few benchmark datasets for learning with simi- 
larity (non-PSD) functions (5) as well as on a variety of standard UCI datasets where the similarity 
function used is the Gaussian kernel function. 

For our experiments, we implemented our methods FTUNE and FTUNE+D as well as BBS and 
BBS+D using MATLAB while using LIBLINEAR |Q7J for SVM classification. For DBOOST, we 
use the C++ code provided by the authors of [2 |. On all the datasets we randomly selected a fixed 
percentage of data for training, validation and testing. Except for DBOOST , we selected the SVM 
penalty constant C from the set {1, 10, 100, 1000} using validation. For each method and dataset, we 
report classification accuracies averaged over 20 runs. We compare accuracies obtained by different 
methods using i-test at 95% significance level. 



3.1 Similarity learning datasets 



First, we conduct experiments on a few similarity learning datasets [5|; these datasets provide a 
(non-PSD) similarity matrix along with class labels. For each of the datasets, we randomly select 
70% of the data for training, 10% for validation and the remaining for testing purposes. We then 
apply our FTUNE-S, FTUNE+D-S, BBS+D methods along with BBS and DBOOST with varying 
number of landmark pairs. Note that we do not apply our FTUNE-M method to these datasets as it 
overfits heavily to these datasets as typically they are small in size. 

We first compare the accuracy achieved by FTUNE+D-S with the existing methods. Table [TJcom- 
pares the accuracies achieved by our FTUNE+D-S method with those of BBS and DBOOST over 
different datasets when using landmark sets of sizes 30 and 300. Numbers in brackets denote stan- 
dard deviation over different runs. Note that in both the tables FTUNE+D-S is one of the best 
methods (upto 95% significance level) on all but one dataset. Furthermore, for datasets with large 
number of classes such as Amazon47 and FaceRec our method outperforms BBS and DBOOST by 
at least 20% percent. Also, note that some of the datasets have multiple bold faced methods, which 
means that the two sample i-test (at 95% level) rejects the hypothesis that their mean is different. 

Next, we evaluate the effectiveness of our landmark selection criteria for both BBS and our method. 
Figure[TJshows the accuracies achieved by various methods on four different datasets with increasing 
number of landmarks. Note that in all the datasets, our diversity based landmark selection criteria 
increases the classification accuracy by around 5 — 6% for small number of landmarks. 
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Amazon47 (Accuracy vs Landmarks) 




-•-FTUNE+D 
■O-FTUNE 

BBS-Li 



50 100 150 200 250 300 



(Accuracy vs Landmarks) 




FaceRec (Accuracy vs Landmarks) 



— FTUNE+D 

■ FTUNE 
*BBS+D 
4 BBS 
-•-DBOOST 



> 



Number of Landmarks 



Figure 1 : Accuracy obtained by various methods on four different datasets as the number of land- 
marks used increases. Note that for small number of landmarks (30, 50) our diversity based landmark 
selection criteria increases accuracy for both BBS and our method FTUNE-S significantly. 



Dataset/Method 


BBS 


DBOOST 


FTUNE-S 


FTUNE-M 


Cod-rna 


0.93(0.01) 


0.89(0.01) 


0.93(0.01) 


0.93(0.01) 


Isolet 


0.81(0.01) 


0.67(0.01) 


0.84(0.01) 


0.83(0.01) 


Letters 


0.67(0.02) 


0.58(0.01) 


0.69(0.01) 


0.68(0.02) 


Magic 


0.82(0.01) 


0.81(0.01) 


0.84(0.01) 


0.84(0.01) 


Pen-digits 


0.94(0.01) 


0.93(0.01) 


0.97(0.01) 


0.97(0.00) 


Nursery 


0.91(0.01) 


0.91(0.01) 


0.90(0.01) 


0.90(0.00) 


Faults 


0.70(0.01) 


0.68(0.02) 


0.70(0.02) 


0.71(0.02) 


Mfeat-pixel 


0.94(0.01) 


0.91(0.01) 


0.95(0.01) 


0.94(0.01) 


Mfeat-zernike 


0.79(0.02) 


0.72(0.02) 


0.79(0.02) 


0.79(0.02) 


Opt-digits 


0.92(0.01) 


0.89(0.01) 


0.94(0.01) 


0.94(0.01) 


Satellite 


0.85(0.01) 


0.86(0.01) 


0.86(0.01) 


0.87(0.01) 


Segment 


0.90(0.01) 


0.93(0.01) 


0.92(0.01) 


0.92(0.01) 



(a) 30 Landmarks 



Dataset/Method 


BBS 


DBOOST 


FTUNF-S 


FTUNF-M 


Cod-rna 


0.94(0.00) 


0.93(0.00) 


0.94(0.00) 


0.94(0.00) 


Isolet 


0.91(0.01) 


0.89(0.01) 


0.93(0.01) 


0.93(0.00) 


Letters 


0.72(0.01) 


0.84(0.01) 


0.83(0.01) 


0.83(0.01) 


Magic 


0.84(0.01) 


0.84(0.00) 


0.85(0.01) 


0.85(0.01) 


Pen-digits 


0.96(0.00) 


0.99(0.00) 


0.99(0.00) 


0.99(0.00) 


Nursery 


0.93(0.01) 


0.97(0.00) 


0.96(0.00) 


0.97(0.0(1) 


Faults 


0.72(0.02) 


0.74(0.02) 


0.73(0.02) 


0.73(0.02) 


Mfeat-pixel 


0.96(0.01) 


0.97(0.01) 


0.97(0.01) 


0.97(0.01) 


Mfeat-zernike 


0.81(0.01) 


0.79(0.01) 


0.82(0.02) 


0.82(0.01) 


Opt-digits 


0.95(0.01) 


0.97(0.00) 


0.98(0.00) 


0.98(0.00) 


Satellite 


0.85(0.01) 


0.90(0.01) 


0.89(0.01) 


0.89(0.01) 


Segment 


0.90(0.01) 


0.96(0.01) 


0.96(0.01) 


0.96(0.01) 



(b) 300 Landmarks 



Table 2: Accuracies for Gaussian Kernel for Embedding Dimensionality=30. Bold numbers indicate 
the best performance with 95% confidence level. Note that both our methods, especially FTUNE-S, 
performs significantly better than the existing methods. 



3.2 UCI benchmark datasets 



We now compare our FTUNE method against existing methods on a variety of UCI datasets |fl9l . 
We ran experiments with FTUNE and FTUNE+D but the latter did not provide any advantage. So 
for lack of space we drop it from our presentation and only show results for FTUNE-S (FTUNE with 
a single transfer function) and FTUNE-M (FTUNE with one transfer function per class). Similar 
to we use the Gaussian kernel function as the similarity function for evaluating our method. 
We set the "width" parameter in the Gaussian kernel to be the mean of all pair-wise training data 
distances, a standard heuristic. For all the datasets, we randomly select 50% data for training, 20% 
for validation and the remaining for testing. We report accuracy values averaged over 20 runs for 
each method with varying number of landmark pairs. 

Table |2] compares the accuracies obtained by our FTUNE-S and FTUNE-M methods with those of 
BBS and DBOOST when applied to different UCI benchmark datasets. Note that FTUNE-S is one 
of the best on most of the datasets for both the landmarking sizes. Also, BBS performs reasonably 
well for small landmarking sizes while DBOOST performs well for large landmarking sizes. In 
contrast, our method consistently outperforms the existing methods in both the scenarios. 

Next, we study accuracies obtained by our method for different landmarking sizes. Figure [2] shows 
accuracies obtained by various methods as the number of landmarks selected increases. Note that 
the accuracy curve of our method dominates the accuracy curves of all the other methods, i.e. our 
method is consistently better than the existing methods for all the landmarking sizes considered. 



3.3 Discussion 



We note that since FTUNE selects its output by way of validation, it is susceptible to over-fitting on 
small datasets but at the same time, capable of giving performance boosts on large ones. We observe 
a similar trend in our experiments - on smaller datasets (such as those in Table[T]with average dataset 
size 660), FTUNE over-fits and performs worse than BBS and DBOOST. However, even in these 
cases, DSELECT (intuitively) removes redundancies in the landmark points thus allowing FTUNE 
to recover the best transfer function. In contrast, for larger datasets like those in Table [2] (average 
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Figure 2: Accuracy achieved by various methods on four different UCI repository datasets as the 
number of landmarks used increases. Note that both FTUNE-S and FTUNE-M perform significantly 
better than BBS and DBOOST for small number of landmarks (30, 50). 



size 13200), FTUNE is itself able to recover better transfer functions than the baseline methods 
and hence both FTUNE-S and FTUNE-M perform significantly better than the baselines. Note that 
DSELECT is not able to provide any advantage here since the datasets sizes being large, greedy 
selection actually ends up hurting the accuracy. 
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A Comparison with the models of Balcan-Blum and Wang et al 

In Q, Wang et al consider a model of learning with distance functions. Their model is similar to our 
but for the difference that they restrict themselves to the use of a single transfer function namely the 
sign function / = sgn(). More formally they have the following notion of a good distance function. 

Definition 8 (| 1 1 Definition 4). A distance function X, d : X x X — > K is said to be an (e, 7, B)-good 
distance for a learning problem where e, 7, B > if there exist two class conditional probability 

distributions f)(x\l(x) = 1) and T>{x\£(x) — —1) such that for all x G X, ^(x \ i{x\~=v) ^ 

an d P(3- | i(3;)=-i) < V^B where T>{x\£(x) = 1) and T>(x\£{x) = —1) are the class conditional 
probability distributions of the problem, such that at least a 1 — e probability mass of examples 
x ~ T> satisfies 

V [d{x,x') < d(x,x")\£(x r ) = £(x),£(x") ^ £{x)] > ]- +7 (4) 

x' ,x"~VxT> 2 

It can be shown (and is implicit in the proof of Theorem 5 in |2]) that the above condition is equiva- 
lent to 

E [w t{x) {x')w_ l{x) {x'')^{d{x,x n )-d{x >2 7 

where w 1 (x) := and := p^j^^II^ - Now define zu(x',x") := 

W£(x'){x')w£( X ")(x") and take / = sgn() as the transfer function in our model. We have, for a 
1 — e fraction of points, 

E [w (x', x") f (K(x, x') - K(x, x")) \i{x') = £[x),i{x") ^ £(x)} > Cn 

x' ,x" ~T>xT> 

which is clearly seen to be equivalent to 

E [w eix) (x')w_ e(x) (x") S ga(K(x,x') - K(x,x"))\£(x') = £(x)J(x") ^ £(x)] > 7 

x ,x r*>T>y.T) 

since Cf = 1 for the sgn() function. Thus the Wang et al model of learning is an instantiation of 
our proposed model. 

In (l), Balcan-Blum present a model of learning with similarity functions. Their model does not 
consider landmark pairs, just singletons. Accordingly, instead of assigning a weight to each land- 
mark pair, one simply assigns a weight to each element of the domain. Consequently one arrives at 
the following notion of a good similarity. 

Definition 9 ([2], Definition 3). A similarity measure K : X x X R is said to be an (e, ~/)-good 
similarity for a learning problem where e, 7 > if for some weighing function w : X — > [— 1, 1], at 
least a 1 — e probability mass of examples x <~ T> satisfies 

E [w(x')K(x,x')\£(x')=£(x)}> E [w (x') K (x, x')\l{x') + £(x)} + 7 (5) 

X' ' r^D x' r^T> 



13 



Now define w + := E [to (x) \£(x) — 1] and w_ := E [to (a;) \£(x) = — 1]. Furthermore, take 

x^T> x^T> 

w(x', x") = w(x')w(x") as the weight function and / = id() as the transfer function in our model. 
Then we have, for a 1 — e fraction of the points, 

x") f (K(x, x') - K(x, x")) \l{x') = £(x),£(x") + £{x)] > C n 
', x") {K{x, x') - K(x, x")) \(,{x') = £{x),£{x") ^ £{x)] > 7 
' ,x") K(x,x')\£{x') = £{x),£{x") ^ £{x)] > 

■',x")K(x,x")\£(x') = e{x),e(x")^e{x)] + 7 

')K(x, x')\£(x') = £(x)] > w e(x) E [w(x')K(x, x')\£{x') ± £(x)] + 7 



E \w (x' 

x>,x"~VxV 

E [w (x 1 

x',x"~VxV 

E [w (x' 

x>,x"~VxV 

E [w (x 1 

x'x"~VxV 



= w_ l(x) E [w(x' 
= E [w'(x')K(x,x')\£{x')=£(x)]> E [w'{x')K(x, x')\£(x') ^ £{x)] + 7 

x'^'D x' ~D 

where C/ = 1 for the id() function and w'(x) = w(x)w_^ x y Note that this again guarantees a 
classifier with margin 7 in the landmarked space. Thus the Balcan-Blum model can also be derived 
in our model. 
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